An integrated, fast and scalable approach for large-scale biological network analysis

Arefin, Ahmed Shamsul

Title: An integrated, fast and scalable approach for large-scale biological network analysis
Creator: Arefin, Ahmed Shamsul
Relation: University of Newcastle Research Higher Degree Thesis
Resource Type: thesis
Date: 2013
Description: Research Doctorate - Computer Science
Description: THE amount of data in our world has been exploding. Computer-based methods used to analyze data ten years ago are impractical today, as the continuously evolving data acquiring technologies are producing more raw data than these methods can handle. For instance, today’s high throughput technologies like DNA microarrays can produce millions of data elements from a particular experiment, where most of the relevant analysis tools are designed to work with only a few tens of thousands. Even though the scalability of these methods/tools may be improved by porting the relevant implementations to a highly expensive super-computer or a cluster of computers, their existing fully connected data representation model can still pose many other restrictions. In this work, instead of using the traditional distance matrix based microarray data analysis model, we propose to use a novel, fast and scalable κ-Nearest Neighbor (κNN) graph-based approach. Moreover, instead of constructing the graph/network on a highly expensive system, we show its construction on graphics processing units (GPUs), which are now widely available as inexpensive, highly parallel devices. The outcome of our κNN graph construction method (termed as GPU-FS-κNN) can be used to carry out many other important computational tasks. In particular, we demonstrate its applications in two popular data analysis methods: clustering and centrality analysis. To do this, we first propose a GPU-based fast method for constructing minimum spanning trees (MST) from the κNN graphs (termed as κNN-Borůvka) and a method for partitioning the trees in an agglomerative fashion (termed as κNN-Borůvka-Agglomerative). Then, we demonstrate the use of κNN graphs in accelerating and scaling the computations of two degree-based (e.g., degree and eigenvectors) and three shortest path based (closeness, eccentricity and betweenness) centrality metrics. At the end, we integrate the developed methods and combinedly apply them on two publicly available gene-expression data sets (Alzheimer’s disease and breast cancer) and their large-scale artificial expansions. Our investigations show that the proposed integrated approach can find both numerically and biologically significant results. We also demonstrate the method’s application in extracting a robust set of gene markers that may warrant further investigations, due to their conspicuous positions in our results.
Subject: data clustering; centrality analysis; GPU-based computation; microarray-based data analysis
Identifier: http://hdl.handle.net/1959.13/938499
Identifier: uon:12629
Language: eng
Full Text

Hits: 1488
Visitors: 1943
Downloads: 537

		Thumbnail	File	Description	Size	Format
View Details Download			ATTACHMENT01	Abstract	186 KB	Adobe Acrobat PDF	View Details Download
View Details Download			ATTACHMENT02	Thesis	13 MB	Adobe Acrobat PDF	View Details Download